perf: improve performance of update metrics #1329

wForget · 2025-01-23T03:43:52Z

Which issue does this PR close?

Closes #1328.

Rationale for this change

Improve performance of update metrics

What changes are included in this PR?

Define a NativeMetricNode proto type to pass all metric nodes at once to avoid iterative jni calls.
Call update metrics when releasing plan to reduce the number of calls.

How are these changes tested?

after this

sql metrics are displayed correctly:

cpu profile:

codecov-commenter · 2025-01-23T05:12:30Z

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 39.06%. Comparing base (f09f8af) to head (71394ae).
Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
...a/org/apache/spark/sql/comet/CometMetricNode.scala	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1329       +/-   ##
=============================================
- Coverage     56.12%   39.06%   -17.07%     
- Complexity      976     2071     +1095     
=============================================
  Files           119      263      +144     
  Lines         11743    60742    +48999     
  Branches       2251    12909    +10658     
=============================================
+ Hits           6591    23729    +17138     
- Misses         4012    32530    +28518     
- Partials       1140     4483     +3343

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2025-01-23T06:07:52Z

Although the proportion of udpate metric in cpu profile has been greatly reduced, the tpcds/tpch benchmark of small data set has not been improved.

native/core/src/execution/metrics/utils.rs

andygrove · 2025-01-23T14:38:55Z

@mbutrovich may be interested in reviewing this as well

andygrove · 2025-01-23T14:39:49Z

native/core/src/execution/jni_api.rs

@@ -508,9 +505,6 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan(
            let next_item = exec_context.stream.as_mut().unwrap().next();
            let poll_output = exec_context.runtime.block_on(async { poll!(next_item) });

-            // Update metrics
-            update_metrics(&mut env, exec_context)?;


I wonder if we should add a config so that we can choose between frequent metrics updates vs just updating once the query completes. It can sometimes be helpful to see live metrics.

Per-batch is probably always overkill. For long-running jobs is there a period that makes sense? It looks like Spark History defaults to 10s.

I do like the idea of updating metrics every N seconds

I think checking a coarse-grained clock (i.e., CLOCK_MONOTONIC_COARSE) to see if N seconds has elapsed to produce updated metrics would be a reasonable compromise on performance impact vs. fresh metrics.

I also like the idea of updating every N seconds. One good reason for updating frequently is to keep updating the live UI.

@mbutrovich Thank you for your idea, sounds great to me, I will try to do that later.

andygrove · 2025-01-23T14:55:46Z

Based on a single run of TPC-H @ 100GB, I see approximately 2% improvement in TPC-H (325s on main vs 318s with this PR)

wForget · 2025-02-05T01:50:27Z

@andygrove @mbutrovich @parthchandra Thank you for your review and sorry for the late reply. I have just finished my Chinese New Year holiday and will continue this work later.

andygrove · 2025-02-05T16:20:50Z

native/core/src/execution/metrics/utils.rs

+    spark_plan.children().iter().for_each(|child_plan| {
+        let child_node = to_native_metric_node(child_plan).unwrap();
+        native_metric_node.children.push(child_node);
+    });


If you change this to a for loop rather than using for_each then we can use ? to handle any error condition.

Suggested change

spark_plan.children().iter().for_each(|child_plan| {

let child_node = to_native_metric_node(child_plan).unwrap();

native_metric_node.children.push(child_node);

});

for child_plan in spark_plan.children() {

let child_node = to_native_metric_node(child_plan)?;

native_metric_node.children.push(child_node);

}

Thank you for your suggestion, changed. I am not familiar with rust yet, and I hope to learn rust by contributing to comet. 😁

mbutrovich · 2025-02-06T05:08:42Z

native/core/src/execution/jni_api.rs

@@ -233,11 +242,12 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_createPlan(
            stream: None,
            runtime,
            metrics,
+            metrics_update_interval,
+            metrics_last_update_time: Instant::now(),


https://github.com/jedisct1/rust-coarsetime

@andygrove thoughts on a coarse time crate? The overhead on these clock_gettime() as used underneath Instant::now() can really add up. Maybe it's a premature optimization, but I also don't want a "death by 1000 cuts" scenario with gettime() all over the place.

I ran coarsetime's benchmark on my laptop:

coarsetime_now(): 126.93 M/s coarsetime_recent(): 340.32 M/s coarsetime_elapsed(): 142.64 M/s coarsetime_since_recent(): 340.34 M/s stdlib_now(): 51.37 M/s stdlib_elapsed(): 42.42 M/s

I'm a bit stunned that Rust's stdlib doesn't provide a nice way to get coarse time on its own, since the performance difference can be quite large and a lot of tasks don't need nanosecond precision.

wForget changed the title ~~Improve performance of update metrics~~ perf: improve performance of update metrics Jan 23, 2025

wForget added 2 commits January 23, 2025 12:07

Improve performance of update metrics

e2c0178

fix style

958476b

wForget force-pushed the COMET-1328 branch from 590fb65 to 958476b Compare January 23, 2025 04:08

fix

8c5724d

wForget force-pushed the COMET-1328 branch from a5df4f1 to 8c5724d Compare January 23, 2025 05:34

fix

642c737

andygrove reviewed Jan 23, 2025

View reviewed changes

native/core/src/execution/metrics/utils.rs Outdated Show resolved Hide resolved

andygrove reviewed Jan 23, 2025

View reviewed changes

wForget added 2 commits February 5, 2025 13:01

add update metrics interval

b52b6ae

fix style

c869bbd

wForget marked this pull request as ready for review February 5, 2025 06:46

andygrove reviewed Feb 5, 2025

View reviewed changes

address comment

71394ae

mbutrovich reviewed Feb 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve performance of update metrics #1329

perf: improve performance of update metrics #1329

wForget commented Jan 23, 2025 •

edited

Loading

codecov-commenter commented Jan 23, 2025 •

edited

Loading

wForget commented Jan 23, 2025

andygrove commented Jan 23, 2025

andygrove Jan 23, 2025

mbutrovich Jan 23, 2025

andygrove Jan 23, 2025

mbutrovich Jan 23, 2025

parthchandra Jan 29, 2025

wForget Feb 5, 2025

andygrove commented Jan 23, 2025 •

edited

Loading

wForget commented Feb 5, 2025

andygrove Feb 5, 2025

wForget Feb 6, 2025

mbutrovich Feb 6, 2025

mbutrovich Feb 6, 2025

perf: improve performance of update metrics #1329

Are you sure you want to change the base?

perf: improve performance of update metrics #1329

Conversation

wForget commented Jan 23, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

after this

codecov-commenter commented Jan 23, 2025 • edited Loading

Codecov Report

wForget commented Jan 23, 2025

andygrove commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 23, 2025 • edited Loading

wForget commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wForget commented Jan 23, 2025 •

edited

Loading

codecov-commenter commented Jan 23, 2025 •

edited

Loading

andygrove commented Jan 23, 2025 •

edited

Loading